Feature Selection, Model Selection, And Tuning Project

Github: https://github.com/cmelende/FeatureAndModelSelectionAndTuningProject.git

Cory Melendez

8/28/2020

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from univariateAnalysis import UniVariateAnalysis, UniVariateReport, OutlierFilter
from metrics import Metrics
from sklearn.metrics import accuracy_score


sns.set(rc={'figure.figsize':(11.7,8.27)})
concrete_df = pd.read_csv('data/concrete.csv')

columns = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg', 'fineagg', 'age']
targetColumn = 'strength'

print_graph = True

def print_all_uni_analysis_reports(df,columnNames):
    seperator = '---------------------------------------------'
    for column in columnNames:
        analysis = UniVariateAnalysis(df, column)
        analysis_report = UniVariateReport(analysis)

        print(seperator)
        print(f'\'{column}\' column univariate analysis report')
        print(seperator)

        analysis_report.print_report()

1. UniVariate Analysis

In [2]:
print_all_uni_analysis_reports(df = concrete_df, columnNames=columns)
---------------------------------------------
'cement' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (102.0, 540.0)
Standard deviation:  104.50636449481532
Q1:  192.375
Q2:  272.9
Q3:  350.0
Q4:  540.0
Mean:  281.16786407766995
Min:  102.0
Median:  272.9
Max:  540.0
Top whisker:  586.4375
Bottom whisker:  -44.0625
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'slag' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (0.0, 359.4)
Standard deviation:  86.27934174810582
Q1:  0.0
Q2:  22.0
Q3:  142.95
Q4:  359.4
Mean:  73.89582524271844
Min:  0.0
Median:  22.0
Max:  359.4
Top whisker:  357.375
Bottom whisker:  -214.42499999999998
Number of outliers above the top whisker:  2
Indices of higher outlier rows
1) 918
2) 990
Number of outliers below the bottom whisker:  0
---------------------------------------------
'ash' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (0.0, 200.1)
Standard deviation:  63.99700415268765
Q1:  0.0
Q2:  0.0
Q3:  118.3
Q4:  200.1
Mean:  54.18834951456311
Min:  0.0
Median:  0.0
Max:  200.1
Top whisker:  295.75
Bottom whisker:  -177.45
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'water' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (121.8, 247.0)
Standard deviation:  21.35421856503247
Q1:  164.9
Q2:  185.0
Q3:  192.0
Q4:  247.0
Mean:  181.56728155339806
Min:  121.8
Median:  185.0
Max:  247.0
Top whisker:  232.64999999999998
Bottom whisker:  124.25000000000001
Number of outliers above the top whisker:  4
Indices of higher outlier rows
1) 66
2) 263
3) 740
4) 826
Number of outliers below the bottom whisker:  5
Indices of bottom outlier rows v3
1)     432
2)     462
3)     587
4)     789
5)     914
---------------------------------------------
'superplastic' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (0.0, 32.2)
Standard deviation:  5.97384139248552
Q1:  0.0
Q2:  6.4
Q3:  10.2
Q4:  32.2
Mean:  6.204660194174758
Min:  0.0
Median:  6.4
Max:  32.2
Top whisker:  25.5
Bottom whisker:  -15.299999999999999
Number of outliers above the top whisker:  10
Indices of higher outlier rows
1) 44
2) 156
3) 232
4) 292
5) 538
6) 744
7) 816
8) 838
9) 955
10) 1026
Number of outliers below the bottom whisker:  0
---------------------------------------------
'coarseagg' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (801.0, 1145.0)
Standard deviation:  77.75395396672077
Q1:  932.0
Q2:  968.0
Q3:  1029.4
Q4:  1145.0
Mean:  972.9189320388349
Min:  801.0
Median:  968.0
Max:  1145.0
Top whisker:  1175.5000000000002
Bottom whisker:  785.8999999999999
Number of outliers above the top whisker:  0
Number of outliers below the bottom whisker:  0
---------------------------------------------
'fineagg' column univariate analysis report
---------------------------------------------
Data type:  float64
Range of values: (594.0, 992.6)
Standard deviation:  80.17598014240437
Q1:  730.9499999999999
Q2:  779.5
Q3:  824.0
Q4:  992.6
Mean:  773.5804854368931
Min:  594.0
Median:  779.5
Max:  992.6
Top whisker:  963.575
Bottom whisker:  591.3749999999998
Number of outliers above the top whisker:  5
Indices of higher outlier rows
1) 129
2) 447
3) 504
4) 584
5) 857
Number of outliers below the bottom whisker:  0
---------------------------------------------
'age' column univariate analysis report
---------------------------------------------
Data type:  int64
Range of values: (1, 365)
Standard deviation:  63.16991158103249
Q1:  7.0
Q2:  28.0
Q3:  56.0
Q4:  365.0
Mean:  45.662135922330094
Min:  1
Median:  28.0
Max:  365
Top whisker:  129.5
Bottom whisker:  -66.5
Number of outliers above the top whisker:  59
Indices of higher outlier rows
1) 51
2) 64
3) 93
4) 99
5) 103
6) 133
7) 144
8) 149
9) 152
10) 157
11) 159
12) 198
13) 199
14) 207
15) 256
16) 262
17) 270
18) 297
19) 302
20) 312
21) 313
22) 323
23) 359
24) 361
25) 370
26) 393
27) 448
28) 465
29) 484
30) 539
31) 570
32) 581
33) 594
34) 601
35) 620
36) 622
37) 623
38) 632
39) 642
40) 696
41) 713
42) 720
43) 721
44) 754
45) 755
46) 776
47) 850
48) 861
49) 878
50) 900
51) 901
52) 919
53) 951
54) 957
55) 971
56) 985
57) 995
58) 1017
59) 1028
Number of outliers below the bottom whisker:  0

2. Bivariate Analysis

Cement

Strong relationship, seems to be a mostly positive correlation between this column and the strength of the column. Thouhgh it does oscilate some - but we may be able to attribute the dips to other factors as well (ie maybe the dips had more water in them which weakened it)

In [3]:
x_column_name = 'cement'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [4]:
if(print_graph):
    sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:04.485651 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [5]:
if(print_graph):
    sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:05.212735 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [6]:
if(print_graph):
    sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:09.440523 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [7]:
if(print_graph):
    sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:10.193485 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Slag

There is a very small relationship between slag and target column, the data points are pretty scattered. Thought that maybe there were outliers that were pulling it in different directions but it does not seem that removing the outliers for that column did much. Good candiate for further invistation in conjunction with other columns

In [8]:
x_column_name = 'slag'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [9]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:12.957096 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [10]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:13.618329 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [11]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:16.292184 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [12]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:16.980343 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Ash

Seems to have a slight negative relationship to target even without any outliers. Considering the domain, a little research suggests that there should be a positive relationship between fly ash and the strength of a concrete. This data does not reinforce that idea, which is interesting. Perhaps this column is not the column driving the relationship in a negative direction. This column should be considered with other columns

In [13]:
x_column_name = 'ash'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [14]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:18.877274 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [15]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:19.565435 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [16]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:21.472365 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [17]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:22.230311 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Water

Pretty strong negative relationship between this column and the target variable. Makes sense with respect to the domain we are considering, common sense would suggest that the more water you add, you dilute the mixture which could cause weak concrete.

In [18]:
x_column_name = 'water'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [19]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:25.439760 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [20]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:26.234608 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [21]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:29.268501 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [22]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:29.960650 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Superplastic

Strong positive relationship here, pretty easy to tell that it is very possible that there is a correlation between plastic and the target column.

In [23]:
x_column_name = 'superplastic'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [24]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:32.204681 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [25]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:32.902786 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [26]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:34.968294 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [27]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:35.670388 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Coarseagg

Although this column has a best fit line that would suggest a negative relationship, I am unconvinced - the graph is oscilatting quite a bit so that may be what is throwing it off. I would be surprised if this had a large affect on the strength of concrete.

In [28]:
x_column_name = 'coarseagg'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [29]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:39.823290 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [30]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:40.588274 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [31]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:44.897726 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [32]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:45.648748 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Fineagg

This is very similar to the column above, best fit line seems to show a negative relationship. However, the graph oscillates quite a bit here too, like the above column there is not really a steady decrease in the peaks or the valleys of the graph

In [33]:
x_column_name = 'fineagg'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [34]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:50.043972 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [35]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:50.794963 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [36]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:55.157304 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [37]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:55.883390 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [ ]:
 

Age

This one is interesting, it suggests the posibility of a strong positive relationship between this column and target. When removing the outliers, we can also see that the relationship remains. Conceptually, when thinking in terms of the domain, it makes sense that the concrete would have a max life where beyond that, the frequency of older aged concrete is less likely. This columns woudl be a very good candidate when considering which columns affect the target

In [38]:
x_column_name = 'age'
analysis = UniVariateAnalysis(concrete_df, x_column_name)
df_no_outlier = analysis.get_df_without_outliers_on_column()
In [39]:
if(True): sns.lineplot(x=x_column_name, y=targetColumn, data=concrete_df) 
2020-08-29T02:27:56.836816 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [40]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=concrete_df, kind='reg')
2020-08-29T02:27:57.531958 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [41]:
if(print_graph): sns.lineplot(x=x_column_name, y=targetColumn, data=df_no_outlier) 
2020-08-29T02:27:58.320848 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [42]:
if(print_graph): sns.jointplot(x=x_column_name, y=targetColumn, data=df_no_outlier, kind='reg')
2020-08-29T02:27:59.006044 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

3. Feature Engineering Techniques

Context

After looking at the dataset, i noticed that were quite a few zeros in many columns in many rows. So to understand the domain (or context), I researched whether or not the zero values were considered valid. fortunately it does seem like the zeros that occur in ['slag','ash','superplastic'] were valid. I saw many websites claim that this is normal. In a production environment, I would delegate this research to a product person that should be more familiar with the domain and could tell me if the values I am seeing are valid.

I noticed there were a few graphs that had some long tails - so what I did was apply a filter to our dataframe where it would remove the rows that had outliers. After I did this, I noticed that

In [43]:
outlier_filter = OutlierFilter(concrete_df, columns)
df_no_outliers = outlier_filter.get_df_without_outliers()
In [44]:
if(print_graph): sns.pairplot(concrete_df, diag_kind='kde')
2020-08-29T02:28:07.535220 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/
In [45]:
if(print_graph): sns.pairplot(df_no_outliers, diag_kind='kde')
2020-08-29T02:28:26.788759 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

Outliers vs No Outliers

Age smoothed out a little

Another techinque that we could emply is spllitting the datasets based on valleys on columns [slag,ash,superplastic] and possible [age]

Logic for splitting test and training is in the constructor of the following object

In [46]:
metrics = Metrics(df_no_outliers, 30, 'strength')
In [47]:
metrics.GetScoreDataframe()
Out[47]:
score explained variance mean abs error mse mean squared log error median abs error r2
gradient 0.846178 0.846223 4.626711 42.688742 0.034253 3.641198 0.846178
boosting 0.692756 0.692832 7.167875 85.266344 0.082504 6.581411 0.692756
bagging 0.861469 0.861501 3.988690 38.445062 0.028659 2.752700 0.861469
dtree 0.755505 0.755705 5.288433 67.852413 0.055370 3.035000 0.755505
linear 0.726373 0.726658 6.626916 75.937117 0.065618 5.276490 0.726373
In [ ]: